-
Notifications
You must be signed in to change notification settings - Fork 53
Fix GrpcChannel handle leak in AzureManaged backend #625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add GrpcChannelCache for thread-safe channel caching by endpoint - Update Client/Worker extensions to use shared cache - Ensure channels are disposed when ServiceProvider disposes - Add comprehensive unit and integration tests
|
I asked Copilot CLI to review this with the following prompt since the currently proposed solution looks far more complicated than I was expecting.
Here's what the response was: I can try to review this more deeply tomorrow. |
- Move channel factory call outside lock to prevent deadlock - Combine nested if statements in Replace method - Use 'using' statement for channel disposal - Catch Exception instead of bare catch - Remove unused variable in test
- Remove separate GrpcChannelCache class - Inline channel caching directly in ConfigureGrpcChannel using ConcurrentDictionary<string, Lazy<GrpcChannel>> - Make ConfigureGrpcChannel implement IDisposable for proper channel disposal - Remove unused Replace() and TryRemove() methods - Add disposal verification tests - Reduces complexity from 170+ LOC to ~40 LOC per extension
- Use LINQ Where() instead of if inside foreach for filtering channels - Narrow catch (Exception) to specific types (OperationCanceledException, ObjectDisposedException)
| options.Channel.Should().NotBeNull(); | ||
|
|
||
| // Dispose the service provider - this should dispose the ConfigureGrpcChannel which disposes channels | ||
| provider.Dispose(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we're disposing the provider but there's no code that actually checks to see if the channel was disposed. I don't think we can rely on this test to know whether the disposal code is working as intended.
| { | ||
| // ShutdownAsync is the graceful way to close a gRPC channel. | ||
| // Fire-and-forget but ensure the channel is eventually disposed. | ||
| _ = Task.Run(async () => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing fire-and-forget here? This seems dangerous from a reliability perspective and makes it much harder to author reliable tests.
| /// <param name="schedulerOptions">Monitor for accessing the current scheduler options configuration.</param> | ||
| class ConfigureGrpcChannel(IOptionsMonitor<DurableTaskSchedulerWorkerOptions> schedulerOptions) : | ||
| IConfigureNamedOptions<GrpcDurableTaskWorkerOptions> | ||
| sealed class ConfigureGrpcChannel : IConfigureNamedOptions<GrpcDurableTaskWorkerOptions>, IDisposable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we able to use IAsyncDisposable instead of IDisposable? The gRPC channel uses an async method to shutdown so it would be better if our dispose code could call it in the proper async way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Fixes a gRPC channel/handle leak in the AzureManaged Durable Task Scheduler integration by caching GrpcChannel instances and disposing them when the DI container is disposed.
Changes:
- Cache
GrpcChannelinstances insideConfigureGrpcChannelusingConcurrentDictionary<..., Lazy<GrpcChannel>>. - Make the channel configurators disposable so cached channels are cleaned up on
ServiceProviderdisposal. - Add tests intended to validate channel reuse and disposal behavior for both client and worker extensions.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.
| File | Description |
|---|---|
| src/Client/AzureManaged/DurableTaskSchedulerClientExtensions.cs | Adds per-configuration GrpcChannel caching and async disposal logic for client-side options configuration. |
| src/Worker/AzureManaged/DurableTaskSchedulerWorkerExtensions.cs | Adds per-configuration GrpcChannel caching and async disposal logic for worker-side options configuration. |
| test/Client/AzureManaged.Tests/DurableTaskSchedulerClientExtensionsTests.cs | Adds tests for channel reuse/isolation and disposal on DI container teardown for client extensions. |
| test/Worker/AzureManaged.Tests/DurableTaskSchedulerWorkerExtensionsTests.cs | Adds tests for channel reuse/isolation and disposal on DI container teardown for worker extensions. |
| sealed class ConfigureGrpcChannel : IConfigureNamedOptions<GrpcDurableTaskWorkerOptions>, IAsyncDisposable | ||
| { | ||
| readonly IOptionsMonitor<DurableTaskSchedulerWorkerOptions> schedulerOptions; | ||
| readonly ConcurrentDictionary<string, Lazy<GrpcChannel>> channels = new(); | ||
| int disposed; |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR description mentions IDisposable and a volatile disposed flag, but this implementation is IAsyncDisposable-only and the disposed field isn’t volatile (and is read without Volatile.Read). Either update the PR description or adjust the implementation to match (e.g., implement IDisposable and use volatile/Volatile.Read for the disposed check).
| sealed class ConfigureGrpcChannel : IConfigureNamedOptions<GrpcDurableTaskClientOptions>, IAsyncDisposable | ||
| { | ||
| readonly IOptionsMonitor<DurableTaskSchedulerClientOptions> schedulerOptions; | ||
| readonly ConcurrentDictionary<string, Lazy<GrpcChannel>> channels = new(); | ||
| int disposed; |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR description mentions IDisposable and a volatile disposed flag, but this implementation is IAsyncDisposable-only and the disposed field isn’t volatile (and is read without Volatile.Read). Either update the PR description or adjust the implementation to match (e.g., implement IDisposable and use volatile/Volatile.Read for the disposed check).
| ServiceProvider provider = services.BuildServiceProvider(); | ||
|
|
||
| // Resolve options multiple times to trigger channel configuration | ||
| IOptionsMonitor<GrpcDurableTaskClientOptions> optionsMonitor = provider.GetRequiredService<IOptionsMonitor<GrpcDurableTaskClientOptions>>(); | ||
| GrpcDurableTaskClientOptions options1 = optionsMonitor.Get(Options.DefaultName); | ||
| GrpcDurableTaskClientOptions options2 = optionsMonitor.Get(Options.DefaultName); |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test calls IOptionsMonitor.Get() twice; OptionsMonitor caches options per name, so the second call typically returns the same options instance and won’t re-run Configure (so it doesn’t validate the new channel-caching behavior). Consider forcing new options creation (e.g., via IOptionsFactory/IOptionsSnapshot scopes) and disposing the ServiceProvider to avoid leaking channels/handlers during the test run.
| ServiceProvider provider = services.BuildServiceProvider(); | |
| // Resolve options multiple times to trigger channel configuration | |
| IOptionsMonitor<GrpcDurableTaskClientOptions> optionsMonitor = provider.GetRequiredService<IOptionsMonitor<GrpcDurableTaskClientOptions>>(); | |
| GrpcDurableTaskClientOptions options1 = optionsMonitor.Get(Options.DefaultName); | |
| GrpcDurableTaskClientOptions options2 = optionsMonitor.Get(Options.DefaultName); | |
| using ServiceProvider provider = services.BuildServiceProvider(); | |
| // Resolve options multiple times to trigger channel configuration | |
| IOptionsFactory<GrpcDurableTaskClientOptions> optionsFactory = provider.GetRequiredService<IOptionsFactory<GrpcDurableTaskClientOptions>>(); | |
| GrpcDurableTaskClientOptions options1 = optionsFactory.Create(Options.DefaultName); | |
| GrpcDurableTaskClientOptions options2 = optionsFactory.Create(Options.DefaultName); |
| ServiceProvider provider = services.BuildServiceProvider(); | ||
|
|
||
| // Resolve options multiple times to trigger channel configuration | ||
| IOptionsMonitor<GrpcDurableTaskWorkerOptions> optionsMonitor = provider.GetRequiredService<IOptionsMonitor<GrpcDurableTaskWorkerOptions>>(); | ||
| GrpcDurableTaskWorkerOptions options1 = optionsMonitor.Get(Options.DefaultName); | ||
| GrpcDurableTaskWorkerOptions options2 = optionsMonitor.Get(Options.DefaultName); |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test calls IOptionsMonitor.Get() twice; OptionsMonitor caches options per name, so the second call typically returns the same options instance and won’t re-run Configure (so it doesn’t validate the new channel-caching behavior). Consider forcing new options creation (e.g., via IOptionsFactory/IOptionsSnapshot scopes) and disposing the ServiceProvider to avoid leaking channels/handlers during the test run.
| ServiceProvider provider = services.BuildServiceProvider(); | |
| // Resolve options multiple times to trigger channel configuration | |
| IOptionsMonitor<GrpcDurableTaskWorkerOptions> optionsMonitor = provider.GetRequiredService<IOptionsMonitor<GrpcDurableTaskWorkerOptions>>(); | |
| GrpcDurableTaskWorkerOptions options1 = optionsMonitor.Get(Options.DefaultName); | |
| GrpcDurableTaskWorkerOptions options2 = optionsMonitor.Get(Options.DefaultName); | |
| using ServiceProvider provider = services.BuildServiceProvider(); | |
| // Resolve options multiple times to trigger channel configuration via new options instances | |
| IOptionsFactory<GrpcDurableTaskWorkerOptions> optionsFactory = provider.GetRequiredService<IOptionsFactory<GrpcDurableTaskWorkerOptions>>(); | |
| GrpcDurableTaskWorkerOptions options1 = optionsFactory.Create(Options.DefaultName); | |
| GrpcDurableTaskWorkerOptions options2 = optionsFactory.Create(Options.DefaultName); |
| // Act - configure two different named workers with different endpoints | ||
| mockBuilder1.Object.UseDurableTaskScheduler("endpoint1.westus3.durabletask.io", ValidTaskHub, credential); | ||
| mockBuilder2.Object.UseDurableTaskScheduler("endpoint2.westus3.durabletask.io", ValidTaskHub, credential); | ||
| ServiceProvider provider = services.BuildServiceProvider(); |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test uses different endpoints for different named options, so it will pass even if the cache key accidentally ignores the options name. To validate name isolation in the cache key, use the same endpoint/task hub for both names and assert the channels differ; also dispose the ServiceProvider to avoid leaking channels.
| // Act - configure two different named workers with different endpoints | |
| mockBuilder1.Object.UseDurableTaskScheduler("endpoint1.westus3.durabletask.io", ValidTaskHub, credential); | |
| mockBuilder2.Object.UseDurableTaskScheduler("endpoint2.westus3.durabletask.io", ValidTaskHub, credential); | |
| ServiceProvider provider = services.BuildServiceProvider(); | |
| // Act - configure two different named workers with the same endpoint and task hub | |
| mockBuilder1.Object.UseDurableTaskScheduler("endpoint.westus3.durabletask.io", ValidTaskHub, credential); | |
| mockBuilder2.Object.UseDurableTaskScheduler("endpoint.westus3.durabletask.io", ValidTaskHub, credential); | |
| using ServiceProvider provider = services.BuildServiceProvider(); |
| if (exceptions is { Count: > 0 }) | ||
| { | ||
| throw new AggregateException(exceptions); | ||
| } |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DisposeAsync throws an AggregateException when any channel shutdown/dispose fails. Throwing from ServiceProvider disposal can surface as app shutdown failures and is difficult for callers to handle. Consider making this best-effort (swallow/log disposal errors) instead of throwing.
| // Act - configure two different named clients with different endpoints | ||
| mockBuilder1.Object.UseDurableTaskScheduler("endpoint1.westus3.durabletask.io", ValidTaskHub, credential); | ||
| mockBuilder2.Object.UseDurableTaskScheduler("endpoint2.westus3.durabletask.io", ValidTaskHub, credential); | ||
| ServiceProvider provider = services.BuildServiceProvider(); | ||
|
|
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test uses different endpoints for different named options, so it will pass even if the cache key accidentally ignores the options name. To validate name isolation in the cache key, use the same endpoint/task hub for both names and assert the channels differ; also dispose the ServiceProvider to avoid leaking channels.
| // Create a cache key based on the options name, endpoint, and task hub. | ||
| // This ensures channels are reused for the same configuration | ||
| // but separate channels are created for different configurations. | ||
| string cacheKey = $"{optionsName}:{source.EndpointAddress}:{source.TaskHubName}"; |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The channel cache key is built by concatenating strings with ':' delimiters, but EndpointAddress commonly contains ':' (e.g., "https://" or ports). This can create ambiguous keys and potential collisions. Prefer a composite key type (e.g., ValueTuple/record struct) rather than a delimited string.
| string cacheKey = $"{optionsName}:{source.EndpointAddress}:{source.TaskHubName}"; | |
| // Use a delimiter character (\u001F) that will not appear in typical endpoint URIs. | |
| string cacheKey = $"{optionsName}\u001F{source.EndpointAddress}\u001F{source.TaskHubName}"; |
| string optionsName = name ?? Options.DefaultName; | ||
| DurableTaskSchedulerWorkerOptions source = this.schedulerOptions.Get(optionsName); | ||
|
|
||
| // Create a cache key based on the options name, endpoint, and task hub. | ||
| // This ensures channels are reused for the same configuration |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateChannel() behavior depends on more than endpoint/task hub (e.g., ResourceId, Credential, AllowInsecureCredentials, and WorkerId via the call credentials interceptor). If any of these values change while EndpointAddress/TaskHubName stay the same (e.g., via options reload), the cached channel will be reused with stale settings. Consider including these fields in the cache key or enforcing immutability for them.
| if (exceptions is { Count: > 0 }) | ||
| { | ||
| throw new AggregateException(exceptions); | ||
| } |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DisposeAsync throws an AggregateException when any channel shutdown/dispose fails. Throwing from ServiceProvider disposal can surface as app shutdown failures and is difficult for callers to handle. Consider making this best-effort (swallow/log disposal errors) instead of throwing.
Summary
What changed?
ConfigureGrpcChannelclass usingConcurrentDictionary<string, Lazy<GrpcChannel>>ConfigureGrpcChannelimplementIDisposableto properly dispose channels when theServiceProvideris disposedWhy is this change needed?
IConfigureNamedOptions<DurableTaskSchedulerOptions>.Configure()method was being called multiple times (on each retry)GrpcChannel, which allocates internal HTTP handlers and socket connectionsoptions.Channelis pre-set (AzureManaged case),GetCallInvoker()returnsdefaultfor the disposable referenceIssues / work items
Project checklist
release_notes.mdAI-assisted code disclosure (required)
Was an AI tool used? (select one)
If AI was used:
src/Client/AzureManaged/DurableTaskSchedulerClientExtensions.cs(modified)src/Worker/AzureManaged/DurableTaskSchedulerWorkerExtensions.cs(modified)test/Client/AzureManaged.Tests/DurableTaskSchedulerClientExtensionsTests.cs(modified)test/Worker/AzureManaged.Tests/DurableTaskSchedulerWorkerExtensionsTests.cs(modified)GrpcChannelCacheclass to inlineConcurrentDictionary<string, Lazy<GrpcChannel>>per reviewer feedbackAI verification (required if AI was used):
Testing
Automated tests
test/Client/AzureManaged.Tests- 33 tests passedtest/Worker/AzureManaged.Tests- 31 tests passedtest/Shared/AzureManaged.Tests- 20 tests passedManual validation (only if runtime/behavior changed)
Notes for reviewers
ConfigureGrpcChannelclass now usesConcurrentDictionary<string, Lazy<GrpcChannel>>for thread-safe channel cachingLazy<GrpcChannel>ensures thread-safe initialization without holding locks during channel creation (avoids potential deadlocks)ShutdownAsync()is called beforeDispose()for graceful shutdown of in-flight RPCsvolatilekeyword on thedisposedfield ensures proper memory visibility when checking disposal state from multiple threads